Search CORE

5 research outputs found

Minimalistic Approach to Coreference Resolution in Lithuanian Medical Records

Author: Marcin Woźniak
Rimantas Butleris
Rita Butkienė
Robertas Damaševičius
Rytis Maskeliūnas
Voldemaras Žitkus
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2019
Field of study

Coreference resolution is a challenging part of natural language processing (NLP) with applications in machine translation, semantic search and other information retrieval, and decision support systems. Coreference resolution requires linguistic preprocessing and rich language resources for automatically identifying and resolving such expressions. Many rarer and under-resourced languages (such as Lithuanian) lack the required language resources and tools. We present a method for coreference resolution in Lithuanian language and its application for processing e-health records from a hospital reception. Our novelty is the ability to process coreferences with minimal linguistic resources, which is important in linguistic applications for rare and endangered languages. The experimental results show that coreference resolution is applicable to the development of NLP-powered online healthcare services in Lithuania

KTUePubl (Repository of Kaunas University of Technology)

Directory of Open Access Journals

Minimalistic Approach to Coreference Resolution in Lithuanian Medical Records

Author: Marcin Woźniak
Rimantas Butleris
Rita Butkienė
Robertas Damaševičius
Rytis Maskeliūnas
Voldemaras Žitkus
Publication venue: 'Hindawi Limited'
Publication date
Field of study

Crossref

Coreference in Universal Dependencies 1.0 (CorefUD 1.0)

Author: Bourgonje Peter
Cinková Silvie
Hajič Jan
Hardmeier Christian
Krielke Pauline
Landragin Frédéric
Lapshinova-Koltunski Ekaterina
Martí M. Antònia
Mikulová Marie
Nedoluzhko Anna
Novák Michal
Ogrodniczuk Maciej
Popel Martin
Recasens Marta
Stede Manfred
Straka Milan
Toldova Svetlana
Vincze Veronika
Zeldes Amir
Zeman Daniel
Žabokrtský Zdeněk
Žitkus Voldemaras
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 06/04/2022
Field of study

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.0 consists of 17 datasets for 11 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 13 datasets for 10 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 1 for Hungarian, 1 for Lithuanian, 1 for Polish, 1 for Russian, and 1 for Spanish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Version 1.0 consists of the same corpora and languages as the previous version 0.2; however, the English GUM dataset has been updated to a newer and larger version, and in the Czech/English PCEDT dataset, the train-dev-test split has been changed to be compatible with OntoNotes. Nevertheless, the main change is in the file format (the MISC attributes have new form and interpretation)

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

CorefUD 1.0

Author: Bourgonje Peter
Cinková Silvie
Hajič Jan
Hardmeier Christian
Krielke Pauline
Landragin Frédéric
Lapshinova-Koltunski Ekaterina
Martí Antonín Antònia
Mikulová Marie
Nedoluzhko Anna
Novák Michal
Ogrodniczuk Maciej
Popel Martin
Recasens Marta
Stede Manfred
Straka Milan
Toldova Svetlana
Vincze Veronika
Zeldes Amir
Zeman Daniel
Žabokrtský Zdeněk
Žitkus Voldemaras
Publication venue
Publication date: 01/01/2022
Field of study

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 0.2 consists of 17 datasets for 11 languages, and compared to the version 0.2, the file format has been reworked and a number of annotation errors have been fixed

Biblio at Institute of Formal and Applied Linguistics

Coreference in Universal Dependencies 1.1 (CorefUD 1.1)

Author: Acar Kutay
Bourgonje Peter
Cebiroğlu Eryiğit Gülşen
Cinková Silvie
Hajič Jan
Hardmeier Christian
Haug Dag
Jørgensen Tollef
Krielke Pauline
Kåsen Andre
Landragin Frédéric
Lapshinova-Koltunski Ekaterina
Martí M. Antònia
Mikulová Marie
Mæhlum Petter
Nedoluzhko Anna
Novák Michal
Nøklestad Anders
Ogrodniczuk Maciej
Pamay Arslan Tuğba
Popel Martin
Recasens Marta
Solberg Per Erik
Stede Manfred
Straka Milan
Toldova Svetlana
Vadász Noémi
Velldal Erik
Vincze Veronika
Zeldes Amir
Zeman Daniel
Øvrelid Lilja
Žabokrtský Zdeněk
Žitkus Voldemaras
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 24/02/2023
Field of study

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.1 consists of 21 datasets for 13 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 17 datasets for 12 languages (1 dataset for Catalan, 2 for Czech, 2 for English, 1 for French, 2 for German, 2 for Hungarian, 1 for Lithuanian, 2 for Norwegian, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Compared to the previous version 1.0, the version 1.1 comprises new languages and corpora, namely Hungarian-KorKor, Norwegian-BokmaalNARC, Norwegian-NynorskNARC, and Turkish-ITCC. In addition, the English GUM dataset has been updated to a newer and larger version, and the conversion pipelines for most datasets have been refined (a list of all changes in each dataset can be found in the corresponding README file)

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University